Serveur d'exploration sur l'OCR

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Distances between Distributions: Comparing Language Models

Identifieur interne : 001545 ( Main/Exploration ); précédent : 001544; suivant : 001546

Distances between Distributions: Comparing Language Models

Auteurs : Thierry Murgue [France] ; Colin De La Higuera [France]

Source :

RBID : ISTEX:27187A35B4B9CB57D3CC0D83A32284926A6E9184

Abstract

Abstract: Language models are used in a variety of fields in order to support other tasks: classification, next-symbol prediction, pattern analysis. In order to compare language models, or to measure the quality of an acquired model with respect to an empirical distribution, or to evaluate the progress of a learning process, we propose to use distances based on the L 2 norm, or quadratic distances. We prove that these distances can not only be estimated through sampling, but can be effectively computed when both distributions are represented by stochastic deterministic finite automata. We provide a set of experiments showing a fast convergence of the distance through sampling and a good scalability, enabling us to use this distance to decide if two distributions are equal when only samples are provided, or to classify texts.

Url:
DOI: 10.1007/978-3-540-27868-9_28


Affiliations:


Links toward previous steps (curation, corpus...)


Le document en format XML

<record>
<TEI wicri:istexFullTextTei="biblStruct">
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Distances between Distributions: Comparing Language Models</title>
<author>
<name sortKey="Murgue, Thierry" sort="Murgue, Thierry" uniqKey="Murgue T" first="Thierry" last="Murgue">Thierry Murgue</name>
</author>
<author>
<name sortKey="De La Higuera, Colin" sort="De La Higuera, Colin" uniqKey="De La Higuera C" first="Colin" last="De La Higuera">Colin De La Higuera</name>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">ISTEX</idno>
<idno type="RBID">ISTEX:27187A35B4B9CB57D3CC0D83A32284926A6E9184</idno>
<date when="2004" year="2004">2004</date>
<idno type="doi">10.1007/978-3-540-27868-9_28</idno>
<idno type="url">https://api.istex.fr/document/27187A35B4B9CB57D3CC0D83A32284926A6E9184/fulltext/pdf</idno>
<idno type="wicri:Area/Istex/Corpus">001454</idno>
<idno type="wicri:Area/Istex/Curation">001368</idno>
<idno type="wicri:Area/Istex/Checkpoint">000D80</idno>
<idno type="wicri:doubleKey">0302-9743:2004:Murgue T:distances:between:distributions</idno>
<idno type="wicri:Area/Main/Merge">001596</idno>
<idno type="wicri:Area/Main/Curation">001545</idno>
<idno type="wicri:Area/Main/Exploration">001545</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title level="a" type="main" xml:lang="en">Distances between Distributions: Comparing Language Models</title>
<author>
<name sortKey="Murgue, Thierry" sort="Murgue, Thierry" uniqKey="Murgue T" first="Thierry" last="Murgue">Thierry Murgue</name>
<affiliation wicri:level="3">
<country xml:lang="fr">France</country>
<wicri:regionArea>RIM, Ecole des Mines de Saint-Etienne, 158, Cours Fauriel, 42023, Saint-Etienne cedex 2</wicri:regionArea>
<placeName>
<region type="region" nuts="2">Auvergne-Rhône-Alpes</region>
<region type="old region" nuts="2">Rhône-Alpes</region>
<settlement type="city">Saint-Etienne</settlement>
</placeName>
</affiliation>
<affiliation wicri:level="3">
<country xml:lang="fr">France</country>
<wicri:regionArea>EURISE, University of Saint-Etienne, 23 rue du Dr Paul Michelon, 42023, Saint-Etienne cedex 2</wicri:regionArea>
<placeName>
<region type="region" nuts="2">Auvergne-Rhône-Alpes</region>
<region type="old region" nuts="2">Rhône-Alpes</region>
<settlement type="city">Saint-Etienne</settlement>
</placeName>
</affiliation>
<affiliation wicri:level="1">
<country wicri:rule="url">France</country>
</affiliation>
</author>
<author>
<name sortKey="De La Higuera, Colin" sort="De La Higuera, Colin" uniqKey="De La Higuera C" first="Colin" last="De La Higuera">Colin De La Higuera</name>
<affiliation wicri:level="3">
<country xml:lang="fr">France</country>
<wicri:regionArea>EURISE, University of Saint-Etienne, 23 rue du Dr Paul Michelon, 42023, Saint-Etienne cedex 2</wicri:regionArea>
<placeName>
<region type="region" nuts="2">Auvergne-Rhône-Alpes</region>
<region type="old region" nuts="2">Rhône-Alpes</region>
<settlement type="city">Saint-Etienne</settlement>
</placeName>
</affiliation>
<affiliation wicri:level="1">
<country wicri:rule="url">France</country>
</affiliation>
</author>
</analytic>
<monogr></monogr>
<series>
<title level="s">Lecture Notes in Computer Science</title>
<imprint>
<date>2004</date>
</imprint>
<idno type="ISSN">0302-9743</idno>
<idno type="eISSN">1611-3349</idno>
<idno type="ISSN">0302-9743</idno>
</series>
<idno type="istex">27187A35B4B9CB57D3CC0D83A32284926A6E9184</idno>
<idno type="DOI">10.1007/978-3-540-27868-9_28</idno>
<idno type="ChapterID">28</idno>
<idno type="ChapterID">Chap28</idno>
</biblStruct>
</sourceDesc>
<seriesStmt>
<idno type="ISSN">0302-9743</idno>
</seriesStmt>
</fileDesc>
<profileDesc>
<textClass></textClass>
<langUsage>
<language ident="en">en</language>
</langUsage>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">Abstract: Language models are used in a variety of fields in order to support other tasks: classification, next-symbol prediction, pattern analysis. In order to compare language models, or to measure the quality of an acquired model with respect to an empirical distribution, or to evaluate the progress of a learning process, we propose to use distances based on the L 2 norm, or quadratic distances. We prove that these distances can not only be estimated through sampling, but can be effectively computed when both distributions are represented by stochastic deterministic finite automata. We provide a set of experiments showing a fast convergence of the distance through sampling and a good scalability, enabling us to use this distance to decide if two distributions are equal when only samples are provided, or to classify texts.</div>
</front>
</TEI>
<affiliations>
<list>
<country>
<li>France</li>
</country>
<region>
<li>Auvergne-Rhône-Alpes</li>
<li>Rhône-Alpes</li>
</region>
<settlement>
<li>Saint-Etienne</li>
</settlement>
</list>
<tree>
<country name="France">
<region name="Auvergne-Rhône-Alpes">
<name sortKey="Murgue, Thierry" sort="Murgue, Thierry" uniqKey="Murgue T" first="Thierry" last="Murgue">Thierry Murgue</name>
</region>
<name sortKey="De La Higuera, Colin" sort="De La Higuera, Colin" uniqKey="De La Higuera C" first="Colin" last="De La Higuera">Colin De La Higuera</name>
<name sortKey="De La Higuera, Colin" sort="De La Higuera, Colin" uniqKey="De La Higuera C" first="Colin" last="De La Higuera">Colin De La Higuera</name>
<name sortKey="Murgue, Thierry" sort="Murgue, Thierry" uniqKey="Murgue T" first="Thierry" last="Murgue">Thierry Murgue</name>
<name sortKey="Murgue, Thierry" sort="Murgue, Thierry" uniqKey="Murgue T" first="Thierry" last="Murgue">Thierry Murgue</name>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Exploration
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 001545 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 001545 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    Main
   |étape=   Exploration
   |type=    RBID
   |clé=     ISTEX:27187A35B4B9CB57D3CC0D83A32284926A6E9184
   |texte=   Distances between Distributions: Comparing Language Models
}}

Wicri

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024